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ABSTRACT 

The ease of generating high-throughput data has 
enabled investigations into organismal complexity at 
the systems level through the inference of networks 
of interactions among the various cellular compo- 
nents (genes, RNAs, proteins and metabolites). The 
wider scientific community, however, currently has 
limited access to tools for network inference, visual- 
ization and analysis because these tasks often 
require advanced computational knowledge and ex- 
pensive computing resources. We have designed the 
network portal (http://networks.systemsbiology.net) 
to serve as a modular database for the integration 
of user uploaded and public data, with inference al- 
gorithms and tools for the storage, visualization and 
analysis of biological networks. The portal is fully 
integrated into the Gaggle framework to seamlessly 
exchange data with desktop and web applications 
and to allow the user to create, save and modify 
workspaces, and it includes social networking 
capabilities for collaborative projects. While the 
current release of the database contains networks 
for 13 prokaryotic organisms from diverse phylogen- 
etic clades (4678 co-regulated gene modules, 3466 
regulators and 9291 c/s-regulatory motifs), it will be 
rapidly populated with prokaryotic and eukaryotic 
organisms as relevant data become available in 
public repositories and through user input. The 
modular architecture, simple data formats and open 
API support community development of the portal. 

INTRODUCTION 

Underlying the phenotype of any organism is the network 
of interactions among its constituent parts encoded in its 



genome. Systems biology is becoming a mature discipline 
for studying this biological complexity through the use of 
high-throughput instruments for data production and 
powerful computing technologies for their analysis. 
A central part of this effort is the reverse engineering of 
gene-regulatory networks through the integration of di- 
verse genome-wide measurements such as gene-expression 
changes, transcription-factor occupancy and protein- 
protein interactions (1-3). 

Biological insights revealed by gene-regulatory 
networks include identification of regulators and regula- 
tory motifs driving particular responses, assignment of 
unannotated genes to biological processes, identification 
of coordinated regulation amongst cellular processes, de- 
scription of overall network architecture for an organism, 
prediction of gene expression in new genetic or environ- 
mental conditions and development of hypotheses for how 
perturbation of regulators or motifs could manipulate 
metabolic flux or other phenotypes. Indeed, exploration 
of these networks has provided unprecedented insights 
into the biology of diverse organisms, including regulation 
of metabolism in bacteria (4), oxidative stress response in 
Archaea (5) and vertebrate immune cell specification (6). 
Further, comparative analysis of regulatory networks 
from multiple species allows insights into evolutionary 
changes in the roles of individual regulators, the regula- 
tion of homologous genes and pathways and overall 
network architecture features such as connectivity and 
density (7,8). 

While there are many algorithms for gene-regulatory- 
network inference (1,2,9) as well as many tools for the 
exploration and analysis of networks (10), the complexity 
of these powerful tools generally makes them inaccessible 
to the wider scientific community. Specifically, network 
inference can be prohibitively difficult for many users 
because it is usually not automated or integrated with 
network-analysis tools, it requires extensive computa- 
tional power and it demands that the user have access to 
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a massive amount of high-quality data. Existing resources 
for storage of network information, such as RegulonDB 
(11), RegPrecise (12), IntegromeDB (13), DBTBS (14), 
CoryneRegNet (15), MTBregList (16), InnateDB (17) 
and BiologicalNetworks (18) are often limited to a few 
model organisms, only store existing network models, or 
are tailored for specific purposes. Similar trends during the 
genomic era led to the development of universal resources 
including NCBI Entrez (19), Gene Expression Omnibus 
(GEO) (20) and KEGG pathway (21). 

We have developed the network portal to democratize 
access to the inference, storage, exploration and visualiza- 
tion of gene-regulatory networks. The network portal is 
connected to an automated network-inference pipeline, 
which can generate networks for any organism (prokary- 
otic or eukaryotic) whose genome is available using gene- 
expression data from public databases or custom user files. 
The network-inference pipeline is modular to allow use of 
different algorithms and currently runs the cMonkey (22) 
and Inferelator (23) algorithms. cMonkey integrates gene- 
expression data together with genomic, proteomic and 
functional associations in order to identify co-regulated 
group of genes under subsets of conditions. Inferelator 
then identifies the transcription factors and environmental 
conditions with the most probable regulatory influences 
on these groups of genes. Inferred networks are stored 
in a relational database. Network analysis is made 
possible by multiple novel tools for visualization, basic 
and advanced search interfaces and easy-to-use filters 
to explore and analyze regulation and gene function. 
The standardization within the network portal will facili- 
tate community development of data, algorithms and 
software, allowing users to perform collaborative 
analysis of raw and processed data. 

MATERIALS AND METHODS 

Architecture 

Architecture overview 

The network portal is composed of four integrated layers. 
The data layer collects genomic information and gene- 
expression data for each organism. This layer provides 
the input for the algorithm layer. Algorithm output 
is stored in a PostgreSQL [http://www.postgresql.org] 
database and served by a Solr/Django-powered web inter- 
face [https://www.djangoproject.com]. The analysis and 
visualization layer allow users to query and explore 
networks at different levels, create and save workspaces 
for in-depth analysis of networks, and broadcast data via 
the Gaggle (24)/Firegoose (25) framework to third-party 
desktop and web applications (Figure 1). 

Database schema 

Network information is stored in a PostgreSQL relational 
database. Each species is association with genome infor- 
mation and one or more inferred networks. Modules, 
which are sets of co-regulated genes, are associated with 
regulatory sequence motifs, conditions where gene expres- 
sion within the module is coherent, influential regulatory 
or environmental factors and functional enrichment 



(Supplementary Figure SI). Database releases are 
publicly available at the Network Portal Github reposi- 
tory [https://github.com/baliga-lab/network_portal]. 

Web interface 

The network portal web interface is built with Python and 
the Django framework. Key word and faceted search is 
provided by Apache Solr [http://lucene.apache.org/solr]. 
Other software technologies used include JQuery [http:// 
jquery.com], NetworkX [http://networkx.github.io], 
Cytoscape Web (26) and R [http://www.r-project.org]. 
Full source code for the network portal is available on 
Github at [https://github.com/baliga-lab/network_portal]. 

The network portal database and web interface will be 
updated and new downloads will be made available every 
6 months as we build network models for new organisms. 

Data sources 

The network portal integrates genomic information and 
upstream promoter sequences from NCBI GenBank (19) 
and RSAT (27), operon predictions from MicrobesOnline 
(28), known and predicted protein-protein interactions 
from EMBL STRING (29) and functional associations 
from Prolinks (30) and Predictome (31). Functional en- 
richment analyses are based on gene annotations from 
the Gene Ontology (32), KEGG (21), TIGR (33) and 
Cluster of Orthologous Groups (COG) (34) databases. 
Lists of transcription factors for each organism for use 
with the Inferelator algorithm are collected from 
MicrobesOnline and JCVI CMR (33) based on GO (32) 
and COG (34) annotations. 

Gene-expression data was collected from 
MicrobesOnline (28), GEO (20), DISTILLER (35) and 
Baliga lab datasets. All downloaded data are quality- 
checked computationally and manually for data integrity, 
normalization and redundancy. Gene-expression matrices 
for each organism were scanned for duplicate entries 
and converted into log 2 ratios (Redundancy filter). 
Missing gene names or alternate gene names in these 
matrices were fixed by constructing a synonyms table 
(ProbeName filter). The data matrix was filtered to only 
include columns and rows that have enough measure- 
ments over all the conditions and genes (NoChange 
filter). Furthermore, values for each row in the data 
matrix were centered on their median and scaled by 
their standard deviation in preparation for the network 
inference (Center Scale filter). The transcription factor list 
that was used as the list of putative regulatory influences 
for the Inferelator algorithm was assembled using tran- 
scription factor annotations that are supported by GO 
and COG annotations. This list was further manually 
curated to remove TFs with poorly defined annotations. 

Network-inference pipeline 
Data processing 

The automated network-inference pipeline features 
automatic data download from sources including 
MicrobesOnline, GEO and KBase. For network inference, 
gene-expression data is organized into matrices of log 2 
ratios. Such matrices can be used directly when available 
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Figure 1. The network portal Framework. The network portal currently implements the Python-cMonkey algorithm for network inference. Publicly 
available gene-expression data and genomic information is collected from various databases along with functional associations from EMBL 
STRING. Conditionally co-regulated clusters of genes (modules) and motifs discovered by cMonkey are stored in the database. The most 
probable influences on these modules are identified by Inferelator, using a TF list collected from MicrobesOnline and JCVI-CMR. A Django- 
based Web interface dynamically creates module-centered views for Network, Functions, Genes, Regulators and Motifs. Further investigations of the 
networks can be performed by using interoperability and automation frameworks provided by Gaggle and Workflow, respectively. 



or computed by comparing each sample with a reference. 
Genes without significant expression change (< 1.5-fold) in 
any of the experiments are removed. The expression level 
of each gene is normalized to mean = 0 and SD = 1 as 
described previously (22). 

Regulatory network inference 

The first step in the automated inference pipeline is clus- 
tering of conditionally co-regulated genes using cMonkey 
(22). We ported the cMonkey algorithm from R to Python 
with enhancements in modularity and performance for use 
with the network portal (W.J. Wu et al. in preparation). 
Python cMonkey integrates gene-expression data with de 
novo motif prediction and other functional associations 
such as operon predictions, protein-protein interactions 
and genomic neighborhood information to identify 
groups of genes that are co-regulated under a subset of 
the experimental conditions (co-regulated modules). 

Second, the most probable regulatory influences from 
transcription factors or environmental factors on each co- 
regulated module are identified by Inferelator using linear 
regression and model shrinkage techniques (23). We have 
shown previously that Inferelator can predict gene expres- 
sion responses of 80% of Halobacterium salinarum genes 
(36). Positive and negative influences on modules are 
deposited into the database. 

In addition to cMonkey/Inferelator, many other 
powerful network-inference algorithms are available 
(1,2,9). To allow users access to these other algorithms, 
our architecture is designed to be modular. The central 
units of network models are co-regulated modules, 
their member genes and regulators with influences on 
these modules. Most regulatory network-inference algo- 
rithms provide output compatible with this framework 
(see Supplementary Table SI). Therefore, developers can 
easily integrate different algorithms using our API, and 
users will be able to select which inference tool to use. 



Functional enrichment 

We integrated KEGG pathway, Gene Ontology, 
TIGRFam and COG annotations to maximize data 
content. We use hypergeometric P-values to identify sig- 
nificant overlaps between co-regulated module members 
and genes assigned to a particular functional annotation 
category. P-values are corrected for multiple comparisons 
using Benjamini-Hochberg correction and filtered for 
P-values<0.05. 

RESULTS 

Available species 

To demonstrate the flexibility of the network portal, we 
built regulatory networks for organisms from different 
phylogenies and with varying genome complexity and 
available amounts of gene expression data. The first 
release of the network portal includes two Archaea 
and 11 Bacteria, including three Firmicutes, six 
Proteobacteria, a Cyanobacterium and a Bacteroidetes 
species (Table 1). Even though the current version of the 
database includes only prokaryotic species, it is important 
to emphasize that the underlying database structure, 
network inference and web interface are also compatible 
with eukaryotic species and they will be included in the 
database as soon as they become available (37). 

Advanced search features 

The network portal is powered by an Apache Solr search 
engine. Solr provides fast, faceted full-text search 
capabilities for querying a large set of network features 
from an integrated database. Queries can be executed 
based on unique genomic, functional and network param- 
eters as well as ranges of values. Multi-faceted advanced 
searching allows selection of specific organisms and the 
ability to perform queries either at the gene level (for 
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Table 1. Organisms currently in the network portal 
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'name', 'locus tag' and 'function' fields) or module level 
(for 'gene members', 'regulators', 'functions' and 'residual' 
ranges). 

Search results are organized based on species and pre- 
sented with a quick feature overview including annota- 
tions, regulatory influences and module information. 
Further exploration of the search results is possible by 
following links. Moreover, users can visualize the relation- 
ships among selected search results in a network diagram 
with one click using the Cytoscape Web interface (21). See 
Supplementary Materials use Cases 1 and 2 for examples 
of the search features to explore network biology. 

Visualizations 

The module page shows co-expression profiles 
(Figure 2A), member genes (Supplementary 
Figure S2A), transcription factors and environmental 
factors as regulatory influences (Figure 2A and 
Supplementary Figure S2B), and de novo identified 
motifs (Figure 2B). A network view of the module 
created using Cytoscape Web (26) enables interactive ex- 
ploration (Figure 2C). In this view, module member genes, 
motifs and regulatory influences are represented as periph- 
eral nodes connected to core module nodes via edges. For 
each module, regulatory influences are listed in tables 
(Supplementary Figure S2B). 

Transcription factor binding motifs help to elucidate 
regulatory mechanism. cMonkey integrates the MEME 
Suit (39) for de novo motif detection. Motifs for each 
module are listed as logo images along with prediction 
statistics (^-values) and the location of motifs within the 
upstream sequences of the module member genes 
(Figure 2A). Motifs can be broadcast to RegPredict 
(http://regpredict.lbl.gov/regpredict) in order to compare 
conservation in similar species. This integrated motif pre- 
diction and comparative analysis provides an additional 
checkpoint for regulatory motif prediction confidence. 

Identification of functional enrichment for the module 
members is important in associating predicted motifs 
and regulatory influences with pathways. Over- 
represented functional ontology terms from KEGG, GO, 



TIGRFAM and COG are presented for each module 
along with hypergeometric P-values and the number of 
module genes assigned to each term. See Supplementary 
Materials Use Case 3 for an example of the use of func- 
tional enrichment information in exploration of gene 
function. 

Gene-landing pages present genomic, functional and 
regulatory information for individual genes. A circular 
visualization displays connections between the selected 
gene and genes in the same modules, with edges drawn 
between the respective coordinates of the whole genome. 
The gene page also lists functional ontology assignments, 
module membership and motifs associated with these 
modules. Genes in the network inherit regulatory influ- 
ences from the modules to which they belong, and the 
regulatory influences table lists influence name, type and 
target module. If the gene is a transcription factor, its 
target modules are displayed in a table with residual 
values and number of genes (See Supplementary 
Material Movie 1, for example, Use Case of gene- 
landing pages). 

Interoperation with other desktop and web applications 
through Gaggle 

We previously developed the Gaggle framework 
for exchanging data among independent desktop 
programs (e.g. Cytoscape, MeV, Firefox, R) and web re- 
sources (e.g. EMBL STRING, KEGG, STAMP, 
MicrobesOnline, RegTransBase, RegPrecise, KBase and 
DAVID) (24). We fully integrated the network portal 
with the Gaggle framework to extend analysis capabilities 
and interoperability with other resources that are not 
included in the portal itself. The Firefox extension 
Firegoose can capture multiple data types (i.e. NameList 
and matrix) and then broadcast this data to other re- 
sources, and data from outside web and desktop applica- 
tions can be broadcasted into the network portal 
Workspace. See Supplementary Materials Use Case 4 
for a case study in obtaining additional data about 
genes of interest using the Firegoose to connect to 
outside webpages. 
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Figure 2. An example module page. (A) The landing page for each module presents a summary view of the module, including an interactive plot of 
gene-expression profiles across conditions, motif locations upstream of the member genes and summary statistics. Tabs located on top of the page 
provide access to other visualization tools. (B) The motif table shows motif logos for de novo identified upstream regulatory motifs and E-value 
statistics. For selected organisms, a link to analyze the motifs using RegPredict is provided. (C) Interactive network visualization is created by using 
Cytoscape Web. An edge connects the module and each of its gene members, motifs and regulators. Clicking on a node opens up overlay window 
with detailed information. 
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Gaggle workspace 

The network portal also acts as a gateway for the Gaggle 
Workspace (N. Jiang et al., submitted for publication). 
Gaggle Workspace provides a central space for 
managing data and an entry point for network analysis. 
Workspace can be used to integrate user-uploaded data 
and data available in the network portal. For each 
organism, it provides Cytoscape network and MeV gene- 
expression files in the form of java web starts, enabling 
cross-platform analysis. Gaggle/Firegoose can be used to 
capture information such as module member gene lists 
into the Workspace and to broadcast data to other appli- 
cations or web resources. One of the unique features of the 
Workspace is the ability to save the state of the analysis, 
including analysis steps, associated data and the results for 
later access or sharing. Watch Movie 2 for a case study 
using Workspace in Supplementary Material. 

Integration of outside resources 

The network portal is designed to be extensible and 
modular to give users flexibility to choose different tools 
for inference and analysis and for developers to integrate 
their resources. As a proof of concept, we integrated 
RegPredict motif analysis into the network portal 
modules for Desulfovibrio vulgaris. RegPredict carries 
out module inference by searching a known position 
weight matrix (PWM) against genomes of closely related 
prokaryotes. The RegPredict link sends the user from a de 
novo identified motif within the network portal to the 
RegPredict website, allowing comparison of two inde- 
pendent motif detection methods. This seamless integra- 
tion enables further exploration of predicted motifs to 
check their evolutionary conservation across multiple 
taxonomically related genomes. 

DISCUSSION 

The network portal improves the availability of regulatory 
information by implementing network-inference algo- 
rithms and novel visualization tools. The first release 
provides gene-regulatory network models for 13 microbial 
species of medical, biotechnological and environmental 
importance. The network portal will be rapidly 
expanded to include the > 100 organisms for which there 
is already sufficient gene expression data available in 
public databases for robust regulatory network inference. 
As more networks become available, network compari- 
sons become possible among species that vary by phylo- 
genetic relationship, environmental niche or metabolic 
and phenotypic features. 

Moreover, the network portal promotes cross-platform 
data analysis and collaboration among researchers with 
distinct areas of expertise. To this end, the network 
portal framework integrates the Gaggle framework and 
will allow developers to add other inference algorithms. 
Further, the new Workspace application enables users to 
upload data, capture information from the web and save 
analysis states, and future releases will add capabilities to 
create projects, workflows, favorites and bookmarks and 
share these features with collaborators. 
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